-
Notifications
You must be signed in to change notification settings - Fork 646
Experimental GGUF-2-PTE Converter #13266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Experimental GGUF-2-PTE Converter #13266
Conversation
…workflow is shown. Functional but prone to lots of random whacky torch export errors
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13266
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2ced5c5 with merge base c8a0706 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
||
torch_dtype = torch.float32 | ||
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename) | ||
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillondesilva what dtype are the weights after loading a GGUF model? Are they dequantized to FP32?
If so, I'm not sure this is really a converter in the sense that it doesn't preserve the quantization from GGUF.
But it is a good start, especially for getting the model structure. We just need to parse the GGUF weights and convert them to int_data/scales/zeros so we can reroute to a kernel. We did have a rudimentary converter for GGUF in torchchat that supported Q4_0 and Q6_K, but this is no longer a popular format.
We could probably start by trying to support Q4_K_M, which requires support for Q4_K and Q6_K. Here is a vibe-coded version of this for Q4_K (so no guarantee that it's correct, but it looks reasonable):
# pip install gguf numpy
import numpy as np
import gguf
# ---- helpers ----
def _fp16le_to_f32(buf_mv):
return np.frombuffer(buf_mv, dtype="<f2", count=1).astype(np.float32)[0]
def _unpack_q4k_scale_min_codes(bytes12: memoryview):
"""Return two arrays (8,) of 6-bit integers for sub-block scales and mins."""
b = np.frombuffer(bytes12, dtype=np.uint8)
# Layout per llama.cpp wiki ("Tensor Encoding Schemes"):
# 0: EEAAAAAA 1: FFBBBBBB 2: GGCCCCCC 3: HHDDDDDD
# 4: eeaaaaaa 5: ffbbbbbb 6: ggcccccc 7: hhdddddd
# 8: eeeeEEEE 9: ffffFFFF 10: ggggGGGG 11: hhhhHHHH
S0_3 = b[0:4] & 0x3F
S4_7 = ((b[0:4] >> 6) & 0x03) | ((b[8:12] >> 4) << 2)
M0_3 = b[4:8] & 0x3F
M4_7 = ((b[4:8] >> 6) & 0x03) | ((b[8:12] & 0x0F) << 2)
S = np.concatenate([S0_3, S4_7]).astype(np.float32) # (8,)
M = np.concatenate([M0_3, M4_7]).astype(np.float32) # (8,)
return S, M
def extract_q4k(gguf_path: str, tensor_name: str):
"""
Returns:
q_codes : (n_super, 256) uint8 -- 4-bit codes per superblock (values 0..15)
scales : (n_super, 8) float32 -- per-subblock scale (real units)
mins : (n_super, 8) float32 -- per-subblock min/offset (real units)
d, dmin : (n_super,) float32 -- super-scales used to decode the 6-bit fields
Notes:
- Each superblock covers 256 weights = 8 sub-blocks * 32 each.
- Reconstruct weights for sub-block j: w = scales[i,j] * q - mins[i,j]
- Zero-point (affine form): z = mins / scales (can be fractional)
"""
r = gguf.GGUFReader(gguf_path)
t = r.tensors_map[tensor_name]
raw = memoryview(t.data)
# Superblock layout (Q4_K):
# [d fp16][dmin fp16][12B packed S/M codes][128B 4-bit codes]
stride = 2 + 2 + 12 + 128 # 144 bytes
n_super = len(raw) // stride
assert len(raw) % stride == 0, "Unexpected Q4_K tensor byte length"
d = np.empty(n_super, dtype=np.float32)
dmin = np.empty(n_super, dtype=np.float32)
S_all = np.empty((n_super, 8), dtype=np.float32)
M_all = np.empty((n_super, 8), dtype=np.float32)
Q_all = np.empty((n_super, 256), dtype=np.uint8)
off = 0
for i in range(n_super):
# two fp16 super-scales
d[i] = _fp16le_to_f32(raw[off:off+2]); off += 2
dmin[i] = _fp16le_to_f32(raw[off:off+2]); off += 2
# packed 6-bit sub-scales / sub-mins
s12 = raw[off:off+12]; off += 12
S6, M6 = _unpack_q4k_scale_min_codes(s12)
# realize to real units
S_all[i, :] = d[i] * S6
M_all[i, :] = dmin[i] * M6
# 128 bytes => 256 4-bit codes
codes_b = np.frombuffer(raw[off:off+128], dtype=np.uint8); off += 128
q_low = (codes_b & 0x0F).astype(np.uint8)
q_high = (codes_b >> 4).astype(np.uint8)
Q_all[i, 0::2] = q_low
Q_all[i, 1::2] = q_high
return Q_all, S_all, M_all, d, dmin
# ---- Example usage ----
# q, s, m, d, dmin = extract_q4k("model.gguf", "model.layers.0.self_attn.q_proj.weight")
# # Dequantize one superblock 'i', sub-block j (32 weights):
# i, j = 0, 3
# w_block = s[i, j] * q[i, j*32:(j+1)*32].astype(np.float32) - m[i, j]
# # Optional affine form zero-point:
# z_block = m[i, j] / s[i, j]
Now we don't currently have any quantized kernels that will handle floating point zeros (in XNNPACK or elsewhere), but I could quickly put up a patch to support that for our lowbit kernels in a day or two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example, the flow looks quite clean. Agree with @metascroy that we may need some custom weight conversion.
I was imagining we could export a PTE file without weights, and plug in gguf weights at runtime, but that also requires some more work on export/runtime before it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch about the weights being dequantized. I pushed a quick update and it does seem that the GGUF weights are dequantized to FP32 (also found it on the docs)
As you've mentioned, it would be great to have some sort of a conversion module we route the model through once the GGUF has been loaded by HF.
What would be the best path forward for development? Do we want an RFC/some abstractions in this PR we can use to capture this process + any additional steps (e.g. dtype conversion)?

cc @swolchok on gguf-pte conversion |
Are the models in the Also CC @mergennachin |
@swolchok Good point! Welp I haven't done a detailed analysis on this but I think its largely dependent on model architecture/operations within it - feel free to have a play around with changing the Perhaps there's some commonalities between what exports well and what doesn't. If we investigate this, it could help understand the limitations of what is/isn't exportable. |
Summary
This PR is not intended for merge. Instead, it is designed to demonstrate a potential method under which
.gguf
files can be converted to.pte
by leveraging some of the existingtransformers
ecosystem.The key idea is the following:
transformers
library into a suitable auto class.Early Learnings & Limitations
Attached code generates a
.pte
model that is yet to be testedThe experiment in this PR converts
SmolLM2-135M-Instruct-Q8_0.gguf
into a.pte
file. However, testing if this model works as expected within the executorch runtime is to be confirmed. This may also require some conversion of thetokenizer
(I'm not too sure but would be interested to know, its probably in the docs somewhere)?Torch export errors can be...scary
Conceptually, I thought that running this experiment would hopefully not be too difficult. However some of the setup issues around getting a reliable torch export for certain models proved to be quite challenging. This could just be due to my lack of knowledge on
torch.export
, however I think it may also offer insight into the opinion of a developer who wants to focus more on having a smooth experience with converting to.pte
files.model_id
andfilename
) and its crazy how scary some of the errors became. Complex traces and rooted in various operators/potentially unsupported ops appeared in the logs. Of course, LFM-2 is quite a new model so perhaps I gave it an unfair example but nonetheless, it could be interesting to see where's the limit of models that can comfortably converted via this workflow.cc @lucylq